A retail convenience chain, Maverik, fuels adventures in more than 380 locations across 12 western states. Maverik is known for premium BonFire food, diesel and unleaded fuel, and in-store merchandise. The company is on an expansion spree and as part of its growth, it recently acquired “ Kum & Go” nearly doubling its store count.
Problem statement
The company is planning to open 30 new stores yearly in a new market, and it needs to forecast daily sales for the new store. This is crucial for financial planning and return on investment (ROI) assessment for these new stores. The challenge lies in predicting the first-year sales for these newly acquired stores accurately.
We have limited amount of data; therefore, we will be using currently available historical dataset to address analytical problems we will be addressing in this project are identifying trends and patterns in sales data, understanding the relationship between sales and other factors, and identifying outliers and anomalies. By addressing these analytical problems, we will be able to build a sales forecasting model.
Solution
We will develop a sales forecasting model that can accurately predict daily sales for the new store. We will be using a variety of forecasting methods, such as time series analysis, regression analysis, causal analysis, and machine learning. The model will be evaluated by comparing its prediction to actual sales data from a holdout dataset. The model performance will be evaluated using industry-standard metrics such as Forecast Accuracy Metrics (e.g. MAE, MAPE, RMSE) and its ability to update forecasts dynamically in response to new data.
This sales forecasting model will help the company to make informed decisions about staffing, inventory, and marketing for the new stores. This can help the company to maximize its profits and minimize its losses while maintaining customer satisfaction.
capital_projects.soft_opening_date : The date this store opened
calendar.calendar_day_date : date
calendar.fiscal_week_id_for_year : Fiscal Week Number
calendar.day_of_week : Day of the week
calendar_information.holiday : Holiday information for that date
calendar_information.type_of_day : Day type
daily_yoy_ndt.total_inside_sales : Inside sales, everything that isn't made at the store
daily_yoy_ndt.total_food_service : Food service sales, everything that is made at the store
diesel : Diesel gallons sold
unleaded : all non-diesel(unleaded) gallons sold
site_id_msba : Unique site key
Based on the information given to us thus far, key columns for later merge train & test tables are going to be:
timeseries --> site_id_msba
qualitative_data --> site_id_msba
Loading our datasets and adding all necessary packages for the project.
#Mounting Google Drive
from google.colab import drive
drive.mount('/content/gdrive')
Mounted at /content/gdrive
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import numpy as np
from itertools import combinations
import warnings
warnings.filterwarnings('ignore')
#Loading and renaming datasets for easy use. Qualitative is named as store_features
timeseries = pd.read_csv('/content/gdrive/MyDrive/Final Capstone - MSBA/Maverik/time_series_data_msba.csv')
store_features = pd.read_csv('/content/gdrive/MyDrive/Final Capstone - MSBA/Maverik/qualitative_data_msba.csv')
# Displays information about the DataFrame
timeseries.info()
#Top 5 Rows of Dataset
display(timeseries.head())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 13908 entries, 0 to 13907 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 13908 non-null int64 1 capital_projects.soft_opening_date 13908 non-null object 2 calendar.calendar_day_date 13908 non-null object 3 calendar.fiscal_week_id_for_year 13908 non-null int64 4 calendar.day_of_week 13908 non-null object 5 calendar_information.holiday 13908 non-null object 6 calendar_information.type_of_day 13908 non-null object 7 daily_yoy_ndt.total_inside_sales 13908 non-null float64 8 daily_yoy_ndt.total_food_service 13908 non-null float64 9 diesel 13908 non-null float64 10 unleaded 13908 non-null float64 11 site_id_msba 13908 non-null int64 dtypes: float64(4), int64(3), object(5) memory usage: 1.3+ MB
| Unnamed: 0 | capital_projects.soft_opening_date | calendar.calendar_day_date | calendar.fiscal_week_id_for_year | calendar.day_of_week | calendar_information.holiday | calendar_information.type_of_day | daily_yoy_ndt.total_inside_sales | daily_yoy_ndt.total_food_service | diesel | unleaded | site_id_msba | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2022-06-14 | 2022-06-17 | 25 | Friday | NONE | WEEKDAY | 2168.2920 | 861.6930 | 722.7745 | 1425.1020 | 24535 |
| 1 | 2 | 2022-06-14 | 2022-06-22 | 25 | Wednesday | NONE | WEEKDAY | 2051.5635 | 808.0275 | 730.4850 | 1436.2740 | 24535 |
| 2 | 3 | 2022-06-14 | 2022-06-23 | 25 | Thursday | NONE | WEEKDAY | 2257.5000 | 966.4410 | 895.7970 | 1594.3725 | 24535 |
| 3 | 4 | 2022-06-14 | 2022-06-26 | 26 | Sunday | NONE | WEEKEND | 1520.5925 | 542.3250 | 584.2900 | 1124.9280 | 24535 |
| 4 | 5 | 2022-06-14 | 2022-06-27 | 26 | Monday | NONE | WEEKDAY | 1897.6930 | 771.4525 | 852.2605 | 1640.2540 | 24535 |
Starting position of timeseries dataset -- 12 columns, 13908 rows. dtypes: float64(4), int64(3), object(5). There seems to be redundancy in data index with "Unnamed" column; therefore, this is being deleted below.
timeseries = timeseries.drop(timeseries.columns[0], axis = 1)
#Dataset overview after dropping unnamed column.
timeseries.head()
| capital_projects.soft_opening_date | calendar.calendar_day_date | calendar.fiscal_week_id_for_year | calendar.day_of_week | calendar_information.holiday | calendar_information.type_of_day | daily_yoy_ndt.total_inside_sales | daily_yoy_ndt.total_food_service | diesel | unleaded | site_id_msba | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2022-06-14 | 2022-06-17 | 25 | Friday | NONE | WEEKDAY | 2168.2920 | 861.6930 | 722.7745 | 1425.1020 | 24535 |
| 1 | 2022-06-14 | 2022-06-22 | 25 | Wednesday | NONE | WEEKDAY | 2051.5635 | 808.0275 | 730.4850 | 1436.2740 | 24535 |
| 2 | 2022-06-14 | 2022-06-23 | 25 | Thursday | NONE | WEEKDAY | 2257.5000 | 966.4410 | 895.7970 | 1594.3725 | 24535 |
| 3 | 2022-06-14 | 2022-06-26 | 26 | Sunday | NONE | WEEKEND | 1520.5925 | 542.3250 | 584.2900 | 1124.9280 | 24535 |
| 4 | 2022-06-14 | 2022-06-27 | 26 | Monday | NONE | WEEKDAY | 1897.6930 | 771.4525 | 852.2605 | 1640.2540 | 24535 |
timeseries.dtypes
capital_projects.soft_opening_date object calendar.calendar_day_date object calendar.fiscal_week_id_for_year int64 calendar.day_of_week object calendar_information.holiday object calendar_information.type_of_day object daily_yoy_ndt.total_inside_sales float64 daily_yoy_ndt.total_food_service float64 diesel float64 unleaded float64 site_id_msba int64 dtype: object
Which features are categorical?
These values classify the samples into sets of similar samples. Within categorical features are the values nominal, ordinal, ratio, or interval based. Among other things this helps us select the appropriate plots for visualization.
Categorical feature : 5
['capital_projects.soft_opening_date', 'calendar.calendar_day_date', 'calendar.day_of_week', 'calendar_information.holiday', 'calendar_information.type_of_day']
Which features are numerical?
These values change from sample to sample. Within numerical features are the values discrete, continuous, or timeseries based. Among other things this helps us select the appropriate plots for visualization.
Numerical feature : 6
Discrete feature : 0
Continous feature : 6
['calendar.fiscal_week_id_for_year', 'daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service', 'diesel', 'unleaded', 'site_id_msba']
timeseries['capital_projects.soft_opening_date'] = pd.to_datetime(timeseries['capital_projects.soft_opening_date'])
timeseries['calendar.calendar_day_date'] = pd.to_datetime(timeseries['calendar.calendar_day_date'])
timeseries.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 13908 entries, 0 to 13907 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 capital_projects.soft_opening_date 13908 non-null datetime64[ns] 1 calendar.calendar_day_date 13908 non-null datetime64[ns] 2 calendar.fiscal_week_id_for_year 13908 non-null int64 3 calendar.day_of_week 13908 non-null object 4 calendar_information.holiday 13908 non-null object 5 calendar_information.type_of_day 13908 non-null object 6 daily_yoy_ndt.total_inside_sales 13908 non-null float64 7 daily_yoy_ndt.total_food_service 13908 non-null float64 8 diesel 13908 non-null float64 9 unleaded 13908 non-null float64 10 site_id_msba 13908 non-null int64 dtypes: datetime64[ns](2), float64(4), int64(2), object(3) memory usage: 1.2+ MB
These lines of code convert the 'capital_projects.soft_opening_date' and 'calendar.calendar_day_date' columns from their current data type to datetime data type . This conversion allows for more effective handling and analysis of date-related data.
timeseries= timeseries.set_index('calendar.calendar_day_date')
This line of code sets the 'calendar.calendar_day_date' column as the "index" of the DataFrame 'time_series'. This means that the date column is used as the row labels for the DataFrame, which can be helpful for time series analysis and accessing data by date.
timeseries['Year'] = timeseries.index.year
timeseries['Month'] = timeseries.index.month
timeseries['Day'] = timeseries.index.day
timeseries['DayOfYear'] = timeseries.index.dayofyear
These lines of code create new columns in the 'time_series' DataFrame: 'Year', 'Month', 'Day', and 'DayOfYear'.
#Creating function to calculate season
def map_month_to_season(month):
if month in [12, 1, 2]:
return 'Winter'
elif month in [3, 4, 5]:
return 'Spring'
elif month in [6, 7, 8]:
return 'Summer'
else:
return 'Fall'
#Adds "season" column to the timeseries dataset
timeseries['season'] = timeseries.index.month.map(map_month_to_season)
This code defines a function map_month_to_season that takes a month (numeric) as input and returns a corresponding season as a string. Seasons are categorized as 'Winter' (December, January, February), 'Spring' (March, April, May), 'Summer' (June, July, August), and 'Fall' (September, October, November).
timeseries.sort_index(inplace=True)
The code sorts the DataFrame 'time_series' based on its index in ascending order. After executing this code, the 'time_series' DataFrame will be sorted based on its index, and the rows will be arranged in ascending order of the index values.
#Overview of the dataset
timeseries.head()
| capital_projects.soft_opening_date | calendar.fiscal_week_id_for_year | calendar.day_of_week | calendar_information.holiday | calendar_information.type_of_day | daily_yoy_ndt.total_inside_sales | daily_yoy_ndt.total_food_service | diesel | unleaded | site_id_msba | Year | Month | Day | DayOfYear | season | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| calendar.calendar_day_date | |||||||||||||||
| 2021-01-12 | 2021-01-12 | 2 | Tuesday | NONE | WEEKDAY | 2036.2685 | 762.8530 | 1424.1850 | 1522.0030 | 21560 | 2021 | 1 | 12 | 12 | Winter |
| 2021-01-13 | 2021-01-12 | 2 | Wednesday | NONE | WEEKDAY | 2379.8880 | 1003.7930 | 1303.8445 | 1853.7715 | 21560 | 2021 | 1 | 13 | 13 | Winter |
| 2021-01-14 | 2021-01-12 | 2 | Thursday | NONE | WEEKDAY | 2435.8600 | 974.2250 | 1375.6785 | 2122.4070 | 21560 | 2021 | 1 | 14 | 14 | Winter |
| 2021-01-15 | 2021-01-12 | 3 | Friday | NONE | WEEKDAY | 2805.9780 | 911.0115 | 1334.9175 | 2267.9930 | 21560 | 2021 | 1 | 15 | 15 | Winter |
| 2021-01-16 | 2021-01-12 | 3 | Saturday | NONE | WEEKEND | 2314.7635 | 715.7535 | 831.1625 | 1819.6395 | 21560 | 2021 | 1 | 16 | 16 | Winter |
timeseries.nunique()
capital_projects.soft_opening_date 32 calendar.fiscal_week_id_for_year 52 calendar.day_of_week 7 calendar_information.holiday 26 calendar_information.type_of_day 2 daily_yoy_ndt.total_inside_sales 13791 daily_yoy_ndt.total_food_service 13529 diesel 13775 unleaded 13812 site_id_msba 38 Year 3 Month 12 Day 31 DayOfYear 365 season 4 dtype: int64
The Above table display the number of Unique values present in each column. It indicates that Data is provided for 38 site_ids for 3 years. It also indicated the number of holidays as 26. There are two types of day( weekday and weekend).
#Function to calculate percent null
def null_values(dataframe):
return dataframe.isnull().sum()/dataframe.shape[0]*100
#Calculated null values
null_values(timeseries)
capital_projects.soft_opening_date 0.0 calendar.fiscal_week_id_for_year 0.0 calendar.day_of_week 0.0 calendar_information.holiday 0.0 calendar_information.type_of_day 0.0 daily_yoy_ndt.total_inside_sales 0.0 daily_yoy_ndt.total_food_service 0.0 diesel 0.0 unleaded 0.0 site_id_msba 0.0 Year 0.0 Month 0.0 Day 0.0 DayOfYear 0.0 season 0.0 dtype: float64
There are no null values in this dataset.
# Function to count outliers in a column
def count_outliers(column):
mean_value = column.mean()
std_dev = column.std()
threshold = 3.5 * std_dev
is_outlier = (column > mean_value + threshold)
return np.sum(is_outlier)
# Filter out object type columns
numerical_data = timeseries.select_dtypes(include=[np.number])
# Initialize a list to store results
results = []
# Calculate and store the count of values, outliers, missing values, and percentage of missing values for each numerical column
for col in numerical_data.columns:
num_values = numerical_data[col].count()
num_outliers = count_outliers(numerical_data[col])
results.append([col, num_values, num_outliers])
# Create a DataFrame from the results
results_df = pd.DataFrame(results, columns=["Column Name", " Values", " Outliers"])
# sorted_results_df = results_df.sort_values(by='Outliers', ascending=False)
# sorted_results_df
print(results_df)
Column Name Values Outliers 0 calendar.fiscal_week_id_for_year 13908 0 1 daily_yoy_ndt.total_inside_sales 13908 72 2 daily_yoy_ndt.total_food_service 13908 116 3 diesel 13908 230 4 unleaded 13908 86 5 site_id_msba 13908 0 6 Year 13908 0 7 Month 13908 0 8 Day 13908 0 9 DayOfYear 13908 0
The above output indicates that 4 target sales variable contain Outliers ( Sigma > 3.5) and the count of outliers is indicated. The percentage of outliers is very minimal ( < 2%) and hence we have planned to retain all these for data exploration and analysis.
# Heatmap of correlation between features
plt.figure(figsize = (10,6))
sns.heatmap(timeseries.corr(), annot=True)
plt.title('Correlation Heatmap')
plt.show()
This heatmap shows correlations between features in the timeseries dataset. We see that food service and inside sales are highly correlated with 0.88, and moderate correlation between diesel and food service sales. This will help us understand relationship between our features and how to approach them.
#Filtering target variables
target_var= timeseries[['daily_yoy_ndt.total_food_service','daily_yoy_ndt.total_inside_sales','diesel','unleaded']]
# Calculate the correlation matrix
correlation_matrix = target_var.corr()
# Create a beautiful correlation grid with correlation factors
plt.figure(figsize=(4, 3))
sns.set(font_scale=1.2)
sns.set_style("whitegrid")
# Use a diverging color palette for better visualization
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Create the correlation heatmap with annotated correlation values
sns.heatmap(correlation_matrix, annot=True, cmap=cmap, vmin=-1, vmax=1, center=0, square=True, fmt=".2f",
linewidths=.5, cbar_kws={"shrink": 0.6}, annot_kws={"size": 14})
plt.title("Correlation Matrix", fontsize=16)
plt.show()
The correlation matrix above reveals significant associations between various factors. Specifically, it highlights a strong correlation of 0.88 between food services and inside sales, suggesting that as food sales increase, so do inside sales. Additionally, there is a moderate correlation of 0.55 between diesel and food services, indicating some degree of connection between these two variables. On the other hand, inside sales and unleaded fuel exhibit a weaker correlation of 0.29, while diesel and food services have an even weaker correlation of 0.14.
# Get all combinations of the target variable names (4 choose 2)
target_combinations = list(combinations(target_var.columns, 2))
# Define a color palette with different colors for each plot
colors = sns.color_palette('husl', len(target_combinations))
# Create a 4x4 grid of scatter plots with trend lines and different colors
plt.figure(figsize=(18, 18))
for i, (var1, var2) in enumerate(target_combinations, start=1):
plt.subplot(4, 4, i)
plt.scatter(target_var[var1], target_var[var2], label='Data', color=colors[i - 1])
# Calculate the linear regression coefficients
coeffs = np.polyfit(target_var[var1], target_var[var2], 1)
x_range = np.array([min(target_var[var1]), max(target_var[var1])])
trend_line = coeffs[0] * x_range + coeffs[1]
# Calculate the correlation coefficient
correlation = np.corrcoef(target_var[var1], target_var[var2])[0, 1]
# Add the trend line with correlation value to the plot
plt.plot(x_range, trend_line, color='red', label=f'Trend Line (Correlation={correlation:.2f}')
plt.xlabel(var1)
plt.ylabel(var2)
plt.title(f'Scatter Plot: {var1} vs. {var2}')
plt.legend()
plt.tight_layout()
plt.show()
This code has been designed to create a grid of scatter plots for the purpose of visually exploring and comparing relationships between pairs of target variables. Within these plots, trend lines have been incorporated to emphasize any linear trends present in the data, with different colors employed to distinguish between the individual plots.
The trend line connecting food services and inside sales showcases a positive correlation, with a coefficient of 0.88. The trend line linking food services and diesel sales reveals a moderate level of correlation, quantified at 0.55. The trend line pertaining to food services and unleaded sales illustrates a weak correlation, registering a coefficient of 0.19. The trend line associated with inside sales and diesel sales demonstrates a moderate correlation, with a coefficient of 0.40. The trend line between inside sales and unleaded sales suggests a weak correlation, as indicated by a coefficient of 0.29. The trend line connecting unleaded and diesel sales portrays a very weak correlation, quantified at 0.14. There are no negative correlations among our target variables.
# Create a figure and axis
plt.figure(figsize=(8, 3))
# Loop through each variable and plot its density curve
for variable in target_var:
sns.kdeplot(timeseries[variable], shade=True, label=variable)
# Add labels and a title
plt.xlabel('Variable Values')
plt.ylabel('Density')
plt.title('Density Plot of various sales')
# Add a legend to distinguish the variables
plt.legend()
# Show the plot
plt.show()
Food Services Curve - Narrower Plot (Less Variability): When the density plot for food services has a narrower shape, it implies that most data points are concentrated closely around a central range. In other words, there is less spread or dispersion in food services data. This suggests that the values for food services tend to be relatively consistent or clustered around a specific level.
Diesel and Unleaded Sales - Wider Plot (Greater Variability): A wider density plot for Diesel and unleaded sales indicates that these variables have more variability. Data points are more spread out across a wider range of values. This suggests that there is a greater diversity in the values of Diesel and unleaded sales, with some data points being significantly higher or lower than the central tendency.
Inside Sales - Moderate Variability: Inside sales fall in between, showing moderate variability. This suggests that while inside sales data points do spread out more than food services, they are not as widely dispersed as Diesel and unleaded sales.
The specific characteristics of these distributions:
Food Sales High Peak (High-Density Values): The prominent peak in the food sales curve indicates that there is a high concentration of data points around a particular value. This central value is associated with high-density values, meaning it's the most common or frequently observed value for food sales.
Diesel and Unleaded Sales Skewed to the Right (Rightward Skew): When Diesel and unleaded sales plots are skewed to the right, it means that there is a longer tail of data points extending towards the higher values. In other words, there are a few data points with much higher sales values that pull the distribution to the right. This suggests that there may be outliers or exceptionally high sales in these categories.
Inside Sales Normally Distributed with Slight Right Skew: Inside sales appear to follow a nearly normal distribution, which is a symmetrical distribution with a characteristic bell-shaped curve. The slight right skew suggests a minor bias towards higher sales values but is not as pronounced as in Diesel and unleaded sales.
In summary, the density plot provides a comprehensive view of the spread, central tendencies, and skewness of the data for different target variables. It helps you understand the variability and distribution characteristics of each variable, enabling more informed data analysis and decision-making.
Below we will be looking into sales growth in the initial period for randomly selected stores.
# Select the top 4 site IDs
top_4_site_ids = timeseries['site_id_msba'].value_counts()[:4].index.tolist()
time_series_data_filtered = timeseries[timeseries['site_id_msba'].isin(top_4_site_ids)]
# Sort the data by calendar_day_date
time_series_data_filtered = time_series_data_filtered.sort_values(by='calendar.calendar_day_date')
# Calculate growth rates
time_series_data_filtered['growth_rate_7d'] = (time_series_data_filtered['daily_yoy_ndt.total_food_service'] - time_series_data_filtered['daily_yoy_ndt.total_food_service'].shift(7)) / time_series_data_filtered['daily_yoy_ndt.total_food_service'].shift(7)
time_series_data_filtered['growth_rate_30d'] = (time_series_data_filtered['daily_yoy_ndt.total_food_service'] - time_series_data_filtered['daily_yoy_ndt.total_food_service'].shift(30)) / time_series_data_filtered['daily_yoy_ndt.total_food_service'].shift(30)
time_series_data_filtered['growth_rate_6m'] = (time_series_data_filtered['daily_yoy_ndt.total_food_service'] - time_series_data_filtered['daily_yoy_ndt.total_food_service'].shift(180)) / time_series_data_filtered['daily_yoy_ndt.total_food_service'].shift(180)
time_series_data_filtered['growth_rate_1y'] = (time_series_data_filtered['daily_yoy_ndt.total_food_service'] - time_series_data_filtered['daily_yoy_ndt.total_food_service'].shift(365)) / time_series_data_filtered['daily_yoy_ndt.total_food_service'].shift(365)
# Set a consistent style for all charts
sns.set_style('whitegrid')
# Create a figure with adjusted size
fig = plt.figure(figsize=(14, 10))
# Define a set of colors to cycle through
colors = ['blue', 'green', 'orange', 'red']
# Create a subplot for each time period
for i, time_period in enumerate(['7 days', '30 days', '6 months', '1 year']):
ax = fig.add_subplot(2, 2, i + 1)
# Plot the data for each site ID with a different color
for j, site_id in enumerate(top_4_site_ids):
site_data = time_series_data_filtered[time_series_data_filtered['site_id_msba'] == site_id]
ax.plot(site_data.index, site_data['daily_yoy_ndt.total_food_service'], label=f'Site {site_id}', color=colors[j % len(colors)]) # Use modulo to cycle through colors
# Add a grid to the chart
ax.grid(True)
# Set the axis limits so that all data points are visible
ax.set_xlim(time_series_data_filtered.index.min(), time_series_data_filtered.index.max())
ax.set_ylim(0, time_series_data_filtered['daily_yoy_ndt.total_food_service'].max()) # Adjust y-axis limit
# Add a title and labels to the chart
ax.set_title(f'Daily Food Service Sales for {time_period}')
ax.set_xlabel('Date')
ax.set_ylabel('Daily Food Service Sales (USD)')
ax.legend()
# Tighten the layout of the figure
plt.tight_layout()
# Display the figure
plt.show()
Daily food service sales have increased at all four sites over the past 7 days.
The growth rate is highest at Site 23415, and lowest at Site 22120.
The variation in growth rate could be due to a number of factors, such as location, target market, and season.
Overall, the graph shows that the food service sales is experiencing positive growth.
# Calculate growth rates
time_series_data_filtered['growth_rate_7d'] = (time_series_data_filtered['daily_yoy_ndt.total_inside_sales'] - time_series_data_filtered['daily_yoy_ndt.total_inside_sales'].shift(7)) / time_series_data_filtered['daily_yoy_ndt.total_inside_sales'].shift(7)
time_series_data_filtered['growth_rate_30d'] = (time_series_data_filtered['daily_yoy_ndt.total_inside_sales'] - time_series_data_filtered['daily_yoy_ndt.total_inside_sales'].shift(30)) / time_series_data_filtered['daily_yoy_ndt.total_inside_sales'].shift(30)
time_series_data_filtered['growth_rate_6m'] = (time_series_data_filtered['daily_yoy_ndt.total_inside_sales'] - time_series_data_filtered['daily_yoy_ndt.total_inside_sales'].shift(180)) / time_series_data_filtered['daily_yoy_ndt.total_inside_sales'].shift(180)
time_series_data_filtered['growth_rate_1y'] = (time_series_data_filtered['daily_yoy_ndt.total_inside_sales'] - time_series_data_filtered['daily_yoy_ndt.total_inside_sales'].shift(365)) / time_series_data_filtered['daily_yoy_ndt.total_inside_sales'].shift(365)
# Set a consistent style for all charts
sns.set_style('whitegrid')
# Create a figure with adjusted size
fig = plt.figure(figsize=(14, 10))
# Define a set of colors to cycle through
colors = ['blue', 'green', 'orange', 'red']
# Create a subplot for each time period
for i, time_period in enumerate(['7 days', '30 days', '6 months', '1 year']):
ax = fig.add_subplot(2, 2, i + 1)
# Plot the data for each site ID with a different color
for j, site_id in enumerate(top_4_site_ids):
site_data = time_series_data_filtered[time_series_data_filtered['site_id_msba'] == site_id]
ax.plot(site_data.index, site_data['daily_yoy_ndt.total_inside_sales'], label=f'Site {site_id}', color=colors[j % len(colors)]) # Use modulo to cycle through colors
# Add a grid to the chart
ax.grid(True)
# Set the axis limits so that all data points are visible
ax.set_xlim(time_series_data_filtered.index.min(), time_series_data_filtered.index.max())
ax.set_ylim(0, time_series_data_filtered['daily_yoy_ndt.total_inside_sales'].max()) # Adjust y-axis limit
# Add a title and labels to the chart
ax.set_title(f'Daily Inside Sales for {time_period}')
ax.set_xlabel('Date')
ax.set_ylabel('Daily Inside Sales (USD)')
ax.legend()
# Tighten the layout of the figure
plt.tight_layout()
# Display the figure
plt.show()
The growth rate is highest in the early time periods, and decreases over time.
The growth rate is most similar for Site 21560 and Site 23415, while Site 23730 and Site 22120 have lower growth rates.
The growth rate is most consistent for Site 23415, while Site 22120, Site 21560, and Site 23730 have some fluctuations.
Overall, the graph provides a positive outlook for the inside sales business
# Calculate growth rates
time_series_data_filtered['growth_rate_7d'] = (time_series_data_filtered['diesel'] - time_series_data_filtered['diesel'].shift(7)) / time_series_data_filtered['diesel'].shift(7)
time_series_data_filtered['growth_rate_30d'] = (time_series_data_filtered['diesel'] - time_series_data_filtered['diesel'].shift(30)) / time_series_data_filtered['diesel'].shift(30)
time_series_data_filtered['growth_rate_6m'] = (time_series_data_filtered['diesel'] - time_series_data_filtered['diesel'].shift(180)) / time_series_data_filtered['diesel'].shift(180)
time_series_data_filtered['growth_rate_1y'] = (time_series_data_filtered['diesel'] - time_series_data_filtered['diesel'].shift(365)) / time_series_data_filtered['diesel'].shift(365)
# Set a consistent style for all charts
sns.set_style('whitegrid')
# Create a figure with adjusted size
fig = plt.figure(figsize=(14, 10))
# Define a set of colors to cycle through
colors = ['blue', 'green', 'orange', 'red']
# Create a subplot for each time period
for i, time_period in enumerate(['7 days', '30 days', '6 months', '1 year']):
ax = fig.add_subplot(2, 2, i + 1)
# Plot the data for each site ID with a different color
for j, site_id in enumerate(top_4_site_ids):
site_data = time_series_data_filtered[time_series_data_filtered['site_id_msba'] == site_id]
ax.plot(site_data.index, site_data['diesel'], label=f'Site {site_id}', color=colors[j % len(colors)]) # Use modulo to cycle through colors
# Add a grid to the chart
ax.grid(True)
# Set the axis limits so that all data points are visible
ax.set_xlim(time_series_data_filtered.index.min(), time_series_data_filtered.index.max())
ax.set_ylim(0, time_series_data_filtered['diesel'].max()) # Adjust y-axis limit
# Add a title and labels to the chart
ax.set_title(f'Daily Diesel Sales (Gallons) for {time_period}')
ax.set_xlabel('Date')
ax.set_ylabel('Daily Diesel Sales (Gallons)')
ax.legend()
# Tighten the layout of the figure
plt.tight_layout()
# Display the figure
plt.show()
The decline in daily diesel sales is most pronounced
The decline in daily diesel sales is more consistent at Site 23415 and Site 21560 than at the other sites.
A decrease in demand for diesel fuel due to the transition to renewable energy sources or the increasing popularity of electric vehicles or seasonal trend
# Calculate growth rates
time_series_data_filtered['growth_rate_7d'] = (time_series_data_filtered['unleaded'] - time_series_data_filtered['unleaded'].shift(7)) / time_series_data_filtered['unleaded'].shift(7)
time_series_data_filtered['growth_rate_30d'] = (time_series_data_filtered['unleaded'] - time_series_data_filtered['unleaded'].shift(30)) / time_series_data_filtered['unleaded'].shift(30)
time_series_data_filtered['growth_rate_6m'] = (time_series_data_filtered['unleaded'] - time_series_data_filtered['unleaded'].shift(180)) / time_series_data_filtered['unleaded'].shift(180)
time_series_data_filtered['growth_rate_1y'] = (time_series_data_filtered['unleaded'] - time_series_data_filtered['unleaded'].shift(365)) / time_series_data_filtered['unleaded'].shift(365)
# Set a consistent style for all charts
sns.set_style('whitegrid')
# Create a figure with adjusted size
fig = plt.figure(figsize=(14, 10))
# Define a set of colors to cycle through
colors = ['blue', 'green', 'orange', 'red']
# Create a subplot for each time period
for i, time_period in enumerate(['7 days', '30 days', '6 months', '1 year']):
ax = fig.add_subplot(2, 2, i + 1)
# Plot the data for each site ID with a different color
for j, site_id in enumerate(top_4_site_ids):
site_data = time_series_data_filtered[time_series_data_filtered['site_id_msba'] == site_id]
ax.plot(site_data.index, site_data['unleaded'], label=f'Site {site_id}', color=colors[j % len(colors)]) # Use modulo to cycle through colors
# Add a grid to the chart
ax.grid(True)
# Set the axis limits so that all data points are visible
ax.set_xlim(time_series_data_filtered.index.min(), time_series_data_filtered.index.max())
ax.set_ylim(0, time_series_data_filtered['unleaded'].max()) # Adjust y-axis limit
# Add a title and labels to the chart
ax.set_title(f'Daily Unleaded Sales (Gallons) for {time_period}')
ax.set_xlabel('Date')
ax.set_ylabel('Daily Unleaded Sales (Gallons)')
ax.legend()
# Tighten the layout of the figure
plt.tight_layout()
# Display the figure
plt.show()
All four sites have experienced a decline in daily diesel sales
There is some variation in the decline rate between the different site IDs.
# Total sales for each row by summing the sales columns
timeseries['total_value'] = timeseries[['daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service', 'diesel', 'unleaded']].sum(axis=1)
# Grouping the data by the unique values in the 'site_id_msba' column
grouped = timeseries.groupby('site_id_msba')['total_value']
# Calculation of the average total sales for each unique value
average_values = grouped.mean()
# Sorting the unique values based on average total sales in descending order
sorted_values = average_values.sort_values(ascending=False)
# Calculate the Z-scores for each data point
z_scores = np.abs(stats.zscore(sorted_values))
# Set a Z-score threshold for highlighting outliers (e.g., Z-score > 2)
z_score_threshold = 2
# Identifying outliers
outliers = z_scores > z_score_threshold
# Scatter plot of the average total sales
plt.figure(figsize=(10, 6))
# Highlighting outliers in red and labeling them with site_id_msba values
for site_id, avg_total, outlier in zip(sorted_values.index, sorted_values.values, outliers):
if outlier:
plt.scatter(site_id, avg_total, marker='o', s=50, alpha=0.5, color='red')
plt.text(site_id, avg_total, site_id, fontsize=10, ha='left', va='top', color='red')
else:
plt.scatter(site_id, avg_total, marker='o', s=50, alpha=0.5, color='blue')
plt.xlabel('Unique Values (site_id_msba)')
plt.ylabel('Average Total Values')
plt.title('Scatter Plot of Average Total Values by Site_IDs')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
It can be observed that, Store 21980 was identified as outlier in terms of average total values among all the 38 stores. One of the target variable was responsible for this outlier, which can be observed in the following scatter plot.
# Grouping the data by the unique values in the 'site_id_msba' column
grouped = timeseries.groupby('site_id_msba')[['daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service', 'diesel', 'unleaded']].mean()
# Calculate the Z-scores for each data point
z_scores = np.abs(stats.zscore(grouped))
# Set a Z-score threshold for highlighting outliers (e.g., Z-score > 2)
z_score_threshold = 2
# Identifying outliers
outliers = z_scores > z_score_threshold
# Create two subplots side by side
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
# Plot for Inside Sales and Food Service
ax1 = axes[0]
sales_labels = ['daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service']
colors1 = ['grey', 'blue']
# Highlighting outliers in red
for i, label in enumerate(sales_labels):
ax1.scatter(grouped.index, grouped[label], marker='o', s=50, alpha=0.5, color=colors1[i], label=label)
ax1.set_xlabel('Unique Values (site_id_msba)')
ax1.set_ylabel('Average Sales')
ax1.set_title('Scatter Plot of Inside Sales and Food Service by Site_IDs')
ax1.legend()
ax1.grid(True, linestyle='--', alpha=0.6)
# Plot for Diesel and Unleaded
ax2 = axes[1]
gallons_labels = ['diesel', 'unleaded']
colors2 = ['green', 'orange']
# Highlighting outliers in red
for i, label in enumerate(gallons_labels):
ax2.scatter(grouped.index, grouped[label], marker='o', s=50, alpha=0.5, color=colors2[i], label=label)
ax2.set_xlabel('Unique Values (site_id_msba)')
ax2.set_ylabel('Average Gallons Sold')
ax2.set_title('Scatter Plot of Average Gallons Sold by Site_IDs')
ax2.legend()
ax2.grid(True, linestyle='--', alpha=0.6)
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
In the two scatter plots shown above, an oultlier in the diesel category had a significant impact on store 21980, causing it to be an outlier in terms of total value.
import matplotlib.pyplot as plt
# Create a line plot for multiple sales columns
plt.figure(figsize=(10, 6))
plt.plot(timeseries.index, timeseries['diesel'], label='Diesel', marker='o')
plt.plot(timeseries.index, timeseries['unleaded'], label='Unleaded', marker='o')
# Set the title and axis labels
plt.title('Sales Over Time')
plt.xlabel('Date')
plt.ylabel('Sales')
# Add a legend
plt.legend()
# Show the plot
plt.grid(True)
plt.tight_layout()
plt.show()
import matplotlib.pyplot as plt
# Create a line plot for multiple sales columns
plt.figure(figsize=(10, 6))
plt.plot(timeseries.index, timeseries['daily_yoy_ndt.total_inside_sales'], label='Inside Sales', marker='o')
plt.plot(timeseries.index, timeseries['daily_yoy_ndt.total_food_service'], label='Food Service', marker='o')
# Set the title and axis labels
plt.title('Sales Over Time')
plt.xlabel('Date')
plt.ylabel('Sales')
# Add a legend
plt.legend()
# Show the plot
plt.grid(True)
plt.tight_layout()
plt.show()
The line plots above depict the trends in target sales. Here are the observed ranges for each category:
Notably, Diesel sales experienced a significant surge in 2022, followed by a drastic drop in 2023. The underlying reasons for this fluctuation require further investigation.
Similarly, Unleaded sales exhibited a gradual improvement from January 2021 to March 2022, followed by a gradual decline extending until July 2023. The factors contributing to this pattern should be explored to gain a deeper understanding of the dynamics at play.
import matplotlib.pyplot as plt
# Calculate daily average sales for each day of the year across all years
daily_avg_sales = timeseries.groupby(['Year', 'calendar.fiscal_week_id_for_year'])['diesel'].mean().reset_index()
# Create a line plot for Diesel Sales
plt.figure(figsize=(10, 6))
for year in daily_avg_sales['Year'].unique():
data = daily_avg_sales[daily_avg_sales['Year'] == year]
plt.plot(data['calendar.fiscal_week_id_for_year'], data['diesel'], label=f'Year {year}', marker='o')
# Set the title and axis labels
plt.title('Weekly Average Diesel Sales for All Site IDs (Three Years)')
plt.xlabel('Week of Year')
plt.ylabel('Average Sales')
# Add a legend
plt.legend()
# Show the plot
plt.grid(True)
plt.tight_layout()
plt.show()
# Calculate daily average sales for each day of the year across all years
daily_avg_sales = timeseries.groupby(['Year', 'calendar.fiscal_week_id_for_year'])['unleaded'].mean().reset_index()
# Create a line plot for Diesel Sales
plt.figure(figsize=(10, 6))
for year in daily_avg_sales['Year'].unique():
data = daily_avg_sales[daily_avg_sales['Year'] == year]
plt.plot(data['calendar.fiscal_week_id_for_year'], data['unleaded'], label=f'Year {year}', marker='o')
# Set the title and axis labels
plt.title('Weekly Average Unleaded Sales for All Site IDs (Three Years)')
plt.xlabel('Week of Year')
plt.ylabel('Average Sales')
# Add a legend
plt.legend()
# Show the plot
plt.grid(True)
plt.tight_layout()
plt.show()
# Calculate daily average sales for each day of the year across all years
daily_avg_sales = timeseries.groupby(['Year', 'calendar.fiscal_week_id_for_year'])['daily_yoy_ndt.total_inside_sales'].mean().reset_index()
# Create a line plot for Diesel Sales
plt.figure(figsize=(10, 6))
for year in daily_avg_sales['Year'].unique():
data = daily_avg_sales[daily_avg_sales['Year'] == year]
plt.plot(data['calendar.fiscal_week_id_for_year'], data['daily_yoy_ndt.total_inside_sales'], label=f'Year {year}', marker='o')
# Set the title and axis labels
plt.title('Weekly Average Inside Sales for All Site IDs (Three Years)')
plt.xlabel('Week of Year')
plt.ylabel('Average Sales')
# Add a legend
plt.legend()
# Show the plot
plt.grid(True)
plt.tight_layout()
plt.show()
# Calculate daily average sales for each day of the year across all years
daily_avg_sales = timeseries.groupby(['Year', 'calendar.fiscal_week_id_for_year'])['daily_yoy_ndt.total_food_service'].mean().reset_index()
# Create a line plot for Diesel Sales
plt.figure(figsize=(10, 6))
for year in daily_avg_sales['Year'].unique():
data = daily_avg_sales[daily_avg_sales['Year'] == year]
plt.plot(data['calendar.fiscal_week_id_for_year'], data['daily_yoy_ndt.total_food_service'], label=f'Year {year}', marker='o')
# Set the title and axis labels
plt.title('Weekly Average Food sales for All Site IDs (Three Years)')
plt.xlabel('Week of Year')
plt.ylabel('Average Sales')
# Add a legend
plt.legend()
# Show the plot
plt.grid(True)
plt.tight_layout()
plt.show()
The four charts represent the weekly average of sales over a span of 52 weeks for three different years across various site locations. A common trend observed in all the charts, except for food services, is a gradual increase in sales from week 1 to week 26, followed by a subsequent decline
Diesel Sales:
Unleaded Sales:
Food Sales:
Inside Sales:
In summary, while the general trend across the years for all sales categories includes a gradual increase followed by a decline, the anomalies observed in 2023, such as sudden spikes and drops, demand a deeper analysis. Identifying the reasons behind these deviations is vital for making informed decisions, adjusting strategies, and responding effectively to changing market dynamics.
avg_sale_day = timeseries.groupby('calendar.day_of_week').agg(
avg_inside_sales =("daily_yoy_ndt.total_inside_sales", "mean"),
avg_food_service =("daily_yoy_ndt.total_food_service", "mean"),
avg_g_diesel =("diesel", "mean"),
avg_g_unleaded =("unleaded", "mean"))
print(avg_sale_day)
ordered_days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
avg_sale_day = avg_sale_day.reindex(ordered_days)
# Get the day of week labels
avg_sale_day_labels = avg_sale_day.index.to_list()
plt.figure(figsize=(12,6))
plt.bar(avg_sale_day_labels, avg_sale_day['avg_inside_sales'], label='Inside Sales')
plt.bar(avg_sale_day_labels, avg_sale_day['avg_g_diesel'], bottom=avg_sale_day['avg_inside_sales'], label='Diesel')
plt.bar(avg_sale_day_labels, avg_sale_day['avg_g_unleaded'], bottom=avg_sale_day['avg_g_diesel'] + avg_sale_day['avg_inside_sales'], label='Unleaded')
plt.bar(avg_sale_day_labels, avg_sale_day['avg_food_service'], bottom=avg_sale_day['avg_g_unleaded'] + avg_sale_day['avg_g_diesel'] + avg_sale_day['avg_inside_sales'], label='Food Service')
#plot clean up
plt.xlabel('Day of Week')
plt.ylabel('Average of Sales')
plt.title('Average of Sales values by Day of Week')
plt.legend(loc='upper right')
plt.show()
avg_inside_sales avg_food_service avg_g_diesel \
calendar.day_of_week
Friday 3366.582282 882.491871 1845.983033
Monday 2780.556207 777.692956 1894.728744
Saturday 2756.193378 640.616270 1147.532033
Sunday 2247.810004 509.612523 1009.676778
Thursday 3066.268194 867.351256 1986.431140
Tuesday 2802.265570 806.664158 1987.379094
Wednesday 2905.310778 832.643233 2034.600464
avg_g_unleaded
calendar.day_of_week
Friday 2794.113315
Monday 2383.800690
Saturday 2363.678907
Sunday 2033.416496
Thursday 2471.164486
Tuesday 2294.119326
Wednesday 2336.505549
This table shows the average inside sales, average gasoline sales in gallons, and average food service sales for each day of the week. The average inside sales are highest on Fridays and lowest on Sundays. The average gasoline sales are highest on Mondays and lowest on Sundays. The average food service sales are highest on Fridays and lowest on Sundays.
# Melt the DataFrame to long format for easier plotting
melted_df = timeseries.melt(id_vars=['calendar_information.type_of_day', 'calendar.day_of_week'],
value_vars=['daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service', 'diesel', 'unleaded'],
var_name='SalesType', value_name='SalesValue')
# Create separate plots for type_of_day
g = sns.catplot(data=melted_df, x='SalesType', y='SalesValue',hue='calendar_information.type_of_day', kind='bar', ci=None, height=6, aspect=2)
g.set_axis_labels('Sales Type', 'Average Sales')
g.set(title='Average Sales by Sales Type (Grouped by Type of Day)')
plt.xticks(rotation=45)
# plt.legend(title='Type of Day', title_fontsize='14')
plt.show()
The plot above illustrates the average sales for weekdays and weekends across all target sales. Notably, the average sales during weekdays consistently outperform those on weekends for all target sales. Specifically, weekday Diesel sales are nearly twice as high as weekend sales.
# List of target variables
target_var = ['daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service', 'diesel', 'unleaded']
tallest_clusters = {}
plt.figure(figsize=(12, 8))
for i, target_variable in enumerate(target_var):
plt.subplot(2, 2, i + 1)
ax = plt.gca()
# Grouping the data by season and calculate the mean sales for each season
grouped_data = timeseries.groupby(['season'])[target_variable].mean()
# Finding the highest and lowest values
highest_season = grouped_data.idxmax()
lowest_season = grouped_data.idxmin()
tallest_clusters[target_variable] = highest_season
# Creating a custom color map ranging from grey to maroon
cmap = plt.cm.get_cmap('viridis', len(grouped_data))
colors = cmap(range(len(grouped_data)))
# Creating a bar plot with custom colors
x = np.arange(len(grouped_data))
bars = plt.bar(x, grouped_data, color=colors)
# Coloring the bars for highest and lowest values
bars[x[grouped_data.index == highest_season][0]].set_color('grey')
bars[x[grouped_data.index == lowest_season][0]].set_color('maroon')
for bar, value in zip(bars, grouped_data):
plt.text(bar.get_x() + bar.get_width()/2, value + 0.02, f'{value:.2f}', ha='center', va='bottom', fontsize=9)
plt.title(target_variable)
plt.xlabel('Season')
plt.ylabel('Mean Sales')
plt.xticks(x, grouped_data.index)
plt.tight_layout()
plt.show()
The season plot above tells us that the target sales are evenly distributed by season. We see consistent low sales in winter for inside sales, food service, diesel and unleaded which is within 21-23%. The inside and unleaded sales are slightly higher in spring than our fall sales.
Lowest : Winter
Spring: the inside and unleaded sales are slightly higher than our fall sales.
Fall: Food service and diesel sales are slightly higher in fall.
timeseries[['Month','diesel', 'unleaded','daily_yoy_ndt.total_food_service','daily_yoy_ndt.total_inside_sales']].groupby('Month').agg({'diesel':['min','mean','max'],'unleaded':['min','mean','max'],'daily_yoy_ndt.total_inside_sales':['min','mean','max'], 'daily_yoy_ndt.total_food_service':['min','mean', 'max']})
| diesel | unleaded | daily_yoy_ndt.total_inside_sales | daily_yoy_ndt.total_food_service | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| min | mean | max | min | mean | max | min | mean | max | min | mean | max | |
| Month | ||||||||||||
| 1 | 0.0000 | 1421.897759 | 8995.5320 | 270.7425 | 2126.609767 | 7522.7355 | 621.9675 | 2367.423785 | 6127.3450 | 69.3560 | 642.086242 | 2109.1910 |
| 2 | 38.5105 | 1532.424074 | 10491.4705 | 469.7455 | 2253.742180 | 8077.2335 | 0.0000 | 2536.521936 | 5742.5970 | 0.0000 | 679.738076 | 1997.9330 |
| 3 | 34.7235 | 1641.627868 | 16807.7070 | 460.5195 | 2350.197651 | 7541.0895 | 1.4350 | 2770.487707 | 6218.7195 | 0.0000 | 740.925361 | 2036.1390 |
| 4 | 12.4985 | 1761.140178 | 17117.4185 | 530.0855 | 2495.593810 | 7046.4800 | 1002.6240 | 2944.823748 | 6367.5115 | 195.8215 | 781.146362 | 2010.1165 |
| 5 | 28.9415 | 1698.356854 | 19405.7955 | 278.5370 | 2490.208546 | 6897.5375 | 165.7285 | 3003.684884 | 6579.7865 | 32.9770 | 793.212617 | 1939.4165 |
| 6 | 54.6175 | 1956.197743 | 20853.9520 | 501.5675 | 2466.486906 | 6415.0415 | 870.6390 | 3147.579068 | 6756.1165 | 100.7930 | 829.796472 | 2145.4475 |
| 7 | 20.0830 | 1823.119042 | 19618.8055 | 468.8670 | 2491.195441 | 6631.9540 | 903.6545 | 3153.351984 | 6794.9490 | 120.4455 | 811.187591 | 2133.5020 |
| 8 | 32.5605 | 1815.885205 | 18172.1680 | 398.4715 | 2465.055987 | 6570.6935 | 767.9350 | 3024.097328 | 6953.0090 | 142.8700 | 806.821880 | 2208.2690 |
| 9 | 31.0625 | 1905.036392 | 17041.3565 | 300.0410 | 2477.374024 | 5825.3090 | 449.0150 | 3065.557971 | 6954.1990 | 92.2635 | 817.643461 | 2294.6805 |
| 10 | 51.6775 | 1792.115625 | 15912.2075 | 297.1430 | 2368.184671 | 5191.1860 | 555.7405 | 2917.805767 | 6872.1520 | 174.7760 | 794.607001 | 2331.8575 |
| 11 | 15.5680 | 1706.428874 | 18434.0625 | 366.7685 | 2340.565521 | 6128.2795 | 16.7545 | 2669.817325 | 6755.1715 | 1.1795 | 734.535376 | 2531.6620 |
| 12 | 14.2625 | 1376.431557 | 18136.0970 | 240.1805 | 2254.662599 | 6359.6680 | 524.1810 | 2539.679306 | 7172.4660 | 93.2890 | 683.190961 | 2506.9485 |
Some key observations from the data include:
Diesel
Unleaded
Food Service
Inside Sales
Overall, the data suggests that sales of diesel and food service is highest in June and lowest in December January. Unleaded is highest in April, inside sales highest in July, while January is lowest for both products. This suggests that there is more variation in sales in the spring than in the winter.
avg_sale_holiday = timeseries.groupby('calendar_information.holiday').agg(
holiday_inside_sales=("daily_yoy_ndt.total_inside_sales", "mean"),
holiday_diesel_gallons=("diesel", "mean"),
holiday_gallons_unleaded=("unleaded", "mean"),
holiday_food_service=("daily_yoy_ndt.total_food_service", "mean")
)
print(avg_sale_holiday)
# Get the holiday labels
holiday_labels = avg_sale_holiday.index.to_list()
# Create the bar chart
plt.figure(figsize=(20,6))
plt.bar(holiday_labels, avg_sale_holiday['holiday_inside_sales'], label='Inside Sales', color='red')
plt.bar(holiday_labels, avg_sale_holiday['holiday_diesel_gallons'], bottom=avg_sale_holiday['holiday_inside_sales'], label='Diesel', color='blue')
plt.bar(holiday_labels, avg_sale_holiday['holiday_gallons_unleaded'], bottom=avg_sale_holiday['holiday_diesel_gallons'] + avg_sale_holiday['holiday_inside_sales'], label='Unleaded', color='green')
plt.bar(holiday_labels, avg_sale_holiday['holiday_food_service'], bottom=avg_sale_holiday['holiday_gallons_unleaded'] + avg_sale_holiday['holiday_diesel_gallons'] + avg_sale_holiday['holiday_inside_sales'], label='Food Service', color='yellow')
# Add labels and title to the plot
plt.xlabel('Holiday')
plt.ylabel('Average Sales')
plt.title('Average Sales by Holiday')
# Show the plot
plt.legend(loc='upper center')
plt.show()
holiday_inside_sales holiday_diesel_gallons \
calendar_information.holiday
All Saint's Day 2763.397461 2013.822355
Ascension 3330.209200 1974.918100
Christmas Day 2578.134171 313.806224
Christmas Eve 2360.981276 571.188303
Columbus Day 2934.279711 2076.533237
Easter 2279.016203 999.590068
Father's Day 2495.160592 1223.395803
Flag Day 3122.260526 2215.815808
Good Friday 3385.759149 1782.902459
Halloween Day 2345.099658 1421.973184
Independence Day 2638.234605 940.666447
Labor Day 2662.100645 1280.359776
Lincoln's Birthday 2341.855158 980.630553
Martin Luther King Day 2305.821276 1524.188105
Memorial Day 2543.034421 1141.945092
Mother's Day 2311.711027 951.893297
NONE 2865.170715 1729.674161
New Year's Day 1724.343803 550.248724
New Year's Eve 2431.326763 808.978303
Palm Sunday 2380.770681 1045.451556
President's Day 2557.950039 1674.773947
Saint Patrick's Day 3073.210368 1779.231908
Saint Valentine's Day 2360.696303 1661.073750
Thanksgiving Day 2531.865553 633.138118
Veteran's Day 2968.978895 2020.961618
Washington's Birthday 2371.655447 1672.093776
holiday_gallons_unleaded holiday_food_service
calendar_information.holiday
All Saint's Day 2352.558526 812.893697
Ascension 2643.381300 918.647500
Christmas Day 1225.588553 498.344829
Christmas Eve 1879.125039 471.516684
Columbus Day 2432.096118 831.205974
Easter 2132.158946 487.723676
Father's Day 2082.782579 547.386829
Flag Day 2385.664077 878.658397
Good Friday 2975.760162 881.056041
Halloween Day 2008.533947 613.975250
Independence Day 1947.723197 567.801592
Labor Day 2470.882934 653.846605
Lincoln's Birthday 2076.331066 531.694855
Martin Luther King Day 2169.764579 648.859658
Memorial Day 2470.153184 616.893697
Mother's Day 2222.622878 500.177622
NONE 2397.596116 768.307761
New Year's Day 1465.395197 395.633553
New Year's Eve 1998.629039 540.002197
Palm Sunday 2185.827875 531.065986
President's Day 2435.921895 704.166197
Saint Patrick's Day 2430.903355 856.535842
Saint Valentine's Day 2187.749237 639.027053
Thanksgiving Day 1535.743355 467.995592
Veteran's Day 2521.700539 855.537237
Washington's Birthday 2040.445750 672.983776
This result will help us determine how to weigh the sales for holidays vs on regular days.
store_features.info()
display(store_features.head())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 37 entries, 0 to 36 Data columns (total 54 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 open_year 37 non-null int64 1 square_feet 37 non-null int64 2 front_door_count 37 non-null int64 3 years_since_last_project 37 non-null int64 4 parking_spaces 37 non-null int64 5 lottery 37 non-null object 6 freal 37 non-null object 7 bonfire_grill 37 non-null object 8 pizza 37 non-null object 9 cinnabon 37 non-null object 10 godfather_s_pizza 37 non-null object 11 ethanol_free 37 non-null object 12 diesel 37 non-null object 13 hi_flow_lanes 37 non-null object 14 rv_lanes 37 non-null object 15 hi_flow_rv_lanes 37 non-null object 16 def 37 non-null object 17 cat_scales 37 non-null object 18 car_wash 37 non-null object 19 ev_charging 37 non-null object 20 rv_dumps 37 non-null object 21 propane 37 non-null object 22 x1_mile_pop 37 non-null int64 23 x1_mile_emp 37 non-null int64 24 x1_mile_income 37 non-null int64 25 x1_2_mile_pop 37 non-null int64 26 x1_2_mile_emp 37 non-null int64 27 x1_2_mile_income 37 non-null int64 28 x5_min_pop 37 non-null int64 29 x5_min_emp 37 non-null int64 30 x5_min_inc 37 non-null int64 31 x7_min_pop 37 non-null int64 32 x7_min_emp 37 non-null int64 33 x7_min_inc 37 non-null int64 34 traditional_forecourt_fueling_positions 37 non-null int64 35 traditional_forecourt_layout 37 non-null object 36 traditional_forecourt_stack_type 37 non-null object 37 rv_lanes_fueling_positions 37 non-null int64 38 rv_lanes_layout 23 non-null object 39 rv_lanes_stack_type 23 non-null object 40 hi_flow_lanes_fueling_positions 37 non-null int64 41 hi_flow_lanes_layout 22 non-null object 42 hi_flow_lanes_stack_type 22 non-null object 43 hi_flow_lanes_fueling_positions_2 37 non-null int64 44 rv_lanes_fueling_positions_2 37 non-null int64 45 hi_flow_rv_lanes_layout 23 non-null object 46 hi_flow_rv_lanes_stack_type 23 non-null object 47 non_24_hour 37 non-null object 48 self_check_out 37 non-null object 49 mens_toilet_count 37 non-null int64 50 mens_urinal_count 37 non-null int64 51 womens_toilet_count 37 non-null int64 52 womens_sink_count 37 non-null int64 53 site_id_msba 37 non-null int64 dtypes: int64(27), object(27) memory usage: 15.7+ KB
| open_year | square_feet | front_door_count | years_since_last_project | parking_spaces | lottery | freal | bonfire_grill | pizza | cinnabon | ... | rv_lanes_fueling_positions_2 | hi_flow_rv_lanes_layout | hi_flow_rv_lanes_stack_type | non_24_hour | self_check_out | mens_toilet_count | mens_urinal_count | womens_toilet_count | womens_sink_count | site_id_msba | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2021 | 5046 | 2 | 2 | 38 | Yes | Yes | Yes | No | No | ... | 6 | Stack | HF/RV | No | Yes | 2 | 2 | 6 | 2 | 21560 |
| 1 | 2021 | 5046 | 2 | 2 | 39 | No | Yes | Yes | Yes | No | ... | 4 | Combo | HF/RV | No | Yes | 5 | 5 | 10 | 4 | 21980 |
| 2 | 2021 | 5046 | 2 | 2 | 35 | Yes | Yes | Yes | Yes | No | ... | 5 | In-Line | None | No | Yes | 3 | 2 | 4 | 1 | 22015 |
| 3 | 2021 | 5046 | 2 | 2 | 36 | No | Yes | Yes | Yes | No | ... | 4 | Combo | HF/RV | No | Yes | 3 | 3 | 6 | 2 | 22085 |
| 4 | 2021 | 5046 | 2 | 2 | 25 | Yes | Yes | Yes | No | No | ... | 0 | NaN | NaN | No | Yes | 0 | 0 | 0 | 0 | 22120 |
5 rows × 54 columns
Starting position of store_features dataset -- 55columns, 37 rows. dtypes: int64(28), object(27).
We removed the first column that was not assigned any name. It does contain only serial numbers and hence removed.
#dropping column
store_features.drop(columns='Unnamed: 0', inplace=True)
store_features.head()
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-70-f08a371911ba> in <cell line: 2>() 1 #dropping column ----> 2 store_features.drop(columns='Unnamed: 0', inplace=True) 3 store_features.head() /usr/local/lib/python3.10/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs) 329 stacklevel=find_stack_level(), 330 ) --> 331 return func(*args, **kwargs) 332 333 # error: "Callable[[VarArg(Any), KwArg(Any)], Any]" has no /usr/local/lib/python3.10/dist-packages/pandas/core/frame.py in drop(self, labels, axis, index, columns, level, inplace, errors) 5397 weight 1.0 0.8 5398 """ -> 5399 return super().drop( 5400 labels=labels, 5401 axis=axis, /usr/local/lib/python3.10/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs) 329 stacklevel=find_stack_level(), 330 ) --> 331 return func(*args, **kwargs) 332 333 # error: "Callable[[VarArg(Any), KwArg(Any)], Any]" has no /usr/local/lib/python3.10/dist-packages/pandas/core/generic.py in drop(self, labels, axis, index, columns, level, inplace, errors) 4503 for axis, labels in axes.items(): 4504 if labels is not None: -> 4505 obj = obj._drop_axis(labels, axis, level=level, errors=errors) 4506 4507 if inplace: /usr/local/lib/python3.10/dist-packages/pandas/core/generic.py in _drop_axis(self, labels, axis, level, errors, only_slice) 4544 new_axis = axis.drop(labels, level=level, errors=errors) 4545 else: -> 4546 new_axis = axis.drop(labels, errors=errors) 4547 indexer = axis.get_indexer(new_axis) 4548 /usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in drop(self, labels, errors) 6932 if mask.any(): 6933 if errors != "ignore": -> 6934 raise KeyError(f"{list(labels[mask])} not found in axis") 6935 indexer = indexer[~mask] 6936 return self.delete(indexer) KeyError: "['Unnamed: 0'] not found in axis"
#descriptive analysis of the dataset
store_features.describe()
| open_year | square_feet | front_door_count | years_since_last_project | parking_spaces | x1_mile_pop | x1_mile_emp | x1_mile_income | x1_2_mile_pop | x1_2_mile_emp | ... | traditional_forecourt_fueling_positions | rv_lanes_fueling_positions | hi_flow_lanes_fueling_positions | hi_flow_lanes_fueling_positions_2 | rv_lanes_fueling_positions_2 | mens_toilet_count | mens_urinal_count | womens_toilet_count | womens_sink_count | site_id_msba | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 37.000000 | 37.00000 | 37.0 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | ... | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 |
| mean | 2021.324324 | 4970.27027 | 2.0 | 1.648649 | 37.405405 | 6703.567568 | 4757.648649 | 53300.378378 | 1833.108108 | 1514.135135 | ... | 14.270270 | 2.513514 | 3.324324 | 3.324324 | 2.513514 | 2.378378 | 2.351351 | 4.648649 | 1.702703 | 23040.405405 |
| std | 0.474579 | 575.93121 | 0.0 | 0.483978 | 5.918237 | 5694.011350 | 4697.168291 | 24333.027254 | 1915.140476 | 2489.423094 | ... | 3.948619 | 2.049683 | 2.925501 | 2.925501 | 2.049683 | 0.923500 | 0.856875 | 1.751447 | 0.740303 | 730.069801 |
| min | 2021.000000 | 2933.00000 | 2.0 | 1.000000 | 23.000000 | 0.000000 | 56.000000 | 0.000000 | 0.000000 | 31.000000 | ... | 10.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 21560.000000 |
| 25% | 2021.000000 | 5046.00000 | 2.0 | 1.000000 | 34.000000 | 1984.000000 | 1771.000000 | 39538.000000 | 262.000000 | 386.000000 | ... | 12.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 4.000000 | 1.000000 | 22540.000000 |
| 50% | 2021.000000 | 5046.00000 | 2.0 | 2.000000 | 38.000000 | 5574.000000 | 3895.000000 | 46356.000000 | 1003.000000 | 1037.000000 | ... | 12.000000 | 4.000000 | 5.000000 | 5.000000 | 4.000000 | 2.000000 | 2.000000 | 4.000000 | 2.000000 | 22890.000000 |
| 75% | 2022.000000 | 5046.00000 | 2.0 | 2.000000 | 41.000000 | 11269.000000 | 6002.000000 | 73519.000000 | 2686.000000 | 1616.000000 | ... | 16.000000 | 4.000000 | 5.000000 | 5.000000 | 4.000000 | 3.000000 | 3.000000 | 6.000000 | 2.000000 | 23555.000000 |
| max | 2022.000000 | 6134.00000 | 2.0 | 2.000000 | 49.000000 | 18692.000000 | 26077.000000 | 110957.000000 | 5923.000000 | 15403.000000 | ... | 24.000000 | 6.000000 | 9.000000 | 9.000000 | 6.000000 | 5.000000 | 5.000000 | 10.000000 | 4.000000 | 24535.000000 |
8 rows × 27 columns
This shows us descriptive stats for our numerical columns. Overall, all of our stores have 2 front doors. We will utilize this output as we move to the next step in analysis. Also, it shows us unique values for our dataset which is useful for data cleaning and feature engineering as needed. This helps us understand the diversity of the data and identify any columns with a small number of unique values. We closely analyze categorical columns to ensure that we don't overlook any values before factorizing.
#Shows us null values
store_features.isnull().sum()
open_year 0 square_feet 0 front_door_count 0 years_since_last_project 0 parking_spaces 0 lottery 0 freal 0 bonfire_grill 0 pizza 0 cinnabon 0 godfather_s_pizza 0 ethanol_free 0 diesel 0 hi_flow_lanes 0 rv_lanes 0 hi_flow_rv_lanes 0 def 0 cat_scales 0 car_wash 0 ev_charging 0 rv_dumps 0 propane 0 x1_mile_pop 0 x1_mile_emp 0 x1_mile_income 0 x1_2_mile_pop 0 x1_2_mile_emp 0 x1_2_mile_income 0 x5_min_pop 0 x5_min_emp 0 x5_min_inc 0 x7_min_pop 0 x7_min_emp 0 x7_min_inc 0 traditional_forecourt_fueling_positions 0 traditional_forecourt_layout 0 traditional_forecourt_stack_type 0 rv_lanes_fueling_positions 0 rv_lanes_layout 14 rv_lanes_stack_type 14 hi_flow_lanes_fueling_positions 0 hi_flow_lanes_layout 15 hi_flow_lanes_stack_type 15 hi_flow_lanes_fueling_positions_2 0 rv_lanes_fueling_positions_2 0 hi_flow_rv_lanes_layout 14 hi_flow_rv_lanes_stack_type 14 non_24_hour 0 self_check_out 0 mens_toilet_count 0 mens_urinal_count 0 womens_toilet_count 0 womens_sink_count 0 site_id_msba 0 dtype: int64
There are 6 columns with null values between 14-15 total. We will try to understand this values and replace/feature engineer as needed.
store_features.nunique()
open_year 2 square_feet 9 front_door_count 1 years_since_last_project 2 parking_spaces 21 lottery 2 freal 2 bonfire_grill 2 pizza 2 cinnabon 2 godfather_s_pizza 1 ethanol_free 2 diesel 1 hi_flow_lanes 2 rv_lanes 2 hi_flow_rv_lanes 2 def 2 cat_scales 2 car_wash 1 ev_charging 1 rv_dumps 2 propane 2 x1_mile_pop 37 x1_mile_emp 37 x1_mile_income 36 x1_2_mile_pop 35 x1_2_mile_emp 37 x1_2_mile_income 33 x5_min_pop 37 x5_min_emp 37 x5_min_inc 36 x7_min_pop 37 x7_min_emp 37 x7_min_inc 37 traditional_forecourt_fueling_positions 6 traditional_forecourt_layout 2 traditional_forecourt_stack_type 3 rv_lanes_fueling_positions 5 rv_lanes_layout 2 rv_lanes_stack_type 2 hi_flow_lanes_fueling_positions 6 hi_flow_lanes_layout 2 hi_flow_lanes_stack_type 1 hi_flow_lanes_fueling_positions_2 6 rv_lanes_fueling_positions_2 5 hi_flow_rv_lanes_layout 3 hi_flow_rv_lanes_stack_type 2 non_24_hour 1 self_check_out 1 mens_toilet_count 6 mens_urinal_count 5 womens_toilet_count 6 womens_sink_count 5 site_id_msba 37 dtype: int64
# Step 1: Identify Identical Columns
identical_columns = []
for col1 in store_features.columns:
for col2 in store_features.columns:
if col1 != col2 and store_features[col1].equals(store_features[col2]):
identical_columns.append(col1)
#correlation between
corr_matrix = store_features.corr()
filtered_corr_matrix = corr_matrix[abs(corr_matrix) >= 0.6]
#correlation map
plt.figure(figsize = (20,10))
sns.heatmap(filtered_corr_matrix, annot=True)
plt.title('Correlation Heatmap')
plt.show()
This plot only shows a correlation higher than 0.6 in our store features dataset. Our highest positive correlation is among population within certain radius, income of the population, and workplace employees. Another leading positive correlation is gasoline lanes, positions, and toilet counts. Negative correlation is between lanes and population in certain radius. We will be utilizing this findings when examining merged datasets.
#Merge dataset
bonfire = timeseries.merge(store_features, on='site_id_msba', how='outer')
#Top 5 rows of data overview
bonfire.head()
| capital_projects.soft_opening_date | calendar.fiscal_week_id_for_year | calendar.day_of_week | calendar_information.holiday | calendar_information.type_of_day | daily_yoy_ndt.total_inside_sales | daily_yoy_ndt.total_food_service | diesel_x | unleaded | site_id_msba | ... | hi_flow_lanes_fueling_positions_2 | rv_lanes_fueling_positions_2 | hi_flow_rv_lanes_layout | hi_flow_rv_lanes_stack_type | non_24_hour | self_check_out | mens_toilet_count | mens_urinal_count | womens_toilet_count | womens_sink_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2021-01-12 | 2 | Tuesday | NONE | WEEKDAY | 2036.2685 | 762.8530 | 1424.1850 | 1522.0030 | 21560 | ... | 4.0 | 6.0 | Stack | HF/RV | No | Yes | 2.0 | 2.0 | 6.0 | 2.0 |
| 1 | 2021-01-12 | 2 | Wednesday | NONE | WEEKDAY | 2379.8880 | 1003.7930 | 1303.8445 | 1853.7715 | 21560 | ... | 4.0 | 6.0 | Stack | HF/RV | No | Yes | 2.0 | 2.0 | 6.0 | 2.0 |
| 2 | 2021-01-12 | 2 | Thursday | NONE | WEEKDAY | 2435.8600 | 974.2250 | 1375.6785 | 2122.4070 | 21560 | ... | 4.0 | 6.0 | Stack | HF/RV | No | Yes | 2.0 | 2.0 | 6.0 | 2.0 |
| 3 | 2021-01-12 | 3 | Friday | NONE | WEEKDAY | 2805.9780 | 911.0115 | 1334.9175 | 2267.9930 | 21560 | ... | 4.0 | 6.0 | Stack | HF/RV | No | Yes | 2.0 | 2.0 | 6.0 | 2.0 |
| 4 | 2021-01-12 | 3 | Saturday | NONE | WEEKEND | 2314.7635 | 715.7535 | 831.1625 | 1819.6395 | 21560 | ... | 4.0 | 6.0 | Stack | HF/RV | No | Yes | 2.0 | 2.0 | 6.0 | 2.0 |
5 rows × 69 columns
bonfire.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 13908 entries, 0 to 13907 Data columns (total 69 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 capital_projects.soft_opening_date 13908 non-null datetime64[ns] 1 calendar.fiscal_week_id_for_year 13908 non-null int64 2 calendar.day_of_week 13908 non-null object 3 calendar_information.holiday 13908 non-null object 4 calendar_information.type_of_day 13908 non-null object 5 daily_yoy_ndt.total_inside_sales 13908 non-null float64 6 daily_yoy_ndt.total_food_service 13908 non-null float64 7 diesel_x 13908 non-null float64 8 unleaded 13908 non-null float64 9 site_id_msba 13908 non-null int64 10 Year 13908 non-null int64 11 Month 13908 non-null int64 12 Day 13908 non-null int64 13 DayOfYear 13908 non-null int64 14 season 13908 non-null object 15 total_value 13908 non-null float64 16 open_year 13542 non-null float64 17 square_feet 13542 non-null float64 18 front_door_count 13542 non-null float64 19 years_since_last_project 13542 non-null float64 20 parking_spaces 13542 non-null float64 21 lottery 13542 non-null object 22 freal 13542 non-null object 23 bonfire_grill 13542 non-null object 24 pizza 13542 non-null object 25 cinnabon 13542 non-null object 26 godfather_s_pizza 13542 non-null object 27 ethanol_free 13542 non-null object 28 diesel_y 13542 non-null object 29 hi_flow_lanes 13542 non-null object 30 rv_lanes 13542 non-null object 31 hi_flow_rv_lanes 13542 non-null object 32 def 13542 non-null object 33 cat_scales 13542 non-null object 34 car_wash 13542 non-null object 35 ev_charging 13542 non-null object 36 rv_dumps 13542 non-null object 37 propane 13542 non-null object 38 x1_mile_pop 13542 non-null float64 39 x1_mile_emp 13542 non-null float64 40 x1_mile_income 13542 non-null float64 41 x1_2_mile_pop 13542 non-null float64 42 x1_2_mile_emp 13542 non-null float64 43 x1_2_mile_income 13542 non-null float64 44 x5_min_pop 13542 non-null float64 45 x5_min_emp 13542 non-null float64 46 x5_min_inc 13542 non-null float64 47 x7_min_pop 13542 non-null float64 48 x7_min_emp 13542 non-null float64 49 x7_min_inc 13542 non-null float64 50 traditional_forecourt_fueling_positions 13542 non-null float64 51 traditional_forecourt_layout 13542 non-null object 52 traditional_forecourt_stack_type 13542 non-null object 53 rv_lanes_fueling_positions 13542 non-null float64 54 rv_lanes_layout 8418 non-null object 55 rv_lanes_stack_type 8418 non-null object 56 hi_flow_lanes_fueling_positions 13542 non-null float64 57 hi_flow_lanes_layout 8052 non-null object 58 hi_flow_lanes_stack_type 8052 non-null object 59 hi_flow_lanes_fueling_positions_2 13542 non-null float64 60 rv_lanes_fueling_positions_2 13542 non-null float64 61 hi_flow_rv_lanes_layout 8418 non-null object 62 hi_flow_rv_lanes_stack_type 8418 non-null object 63 non_24_hour 13542 non-null object 64 self_check_out 13542 non-null object 65 mens_toilet_count 13542 non-null float64 66 mens_urinal_count 13542 non-null float64 67 womens_toilet_count 13542 non-null float64 68 womens_sink_count 13542 non-null float64 dtypes: datetime64[ns](1), float64(31), int64(6), object(31) memory usage: 7.4+ MB
bonfire.describe()
| calendar.fiscal_week_id_for_year | daily_yoy_ndt.total_inside_sales | daily_yoy_ndt.total_food_service | diesel_x | unleaded | site_id_msba | Year | Month | Day | DayOfYear | ... | x7_min_inc | traditional_forecourt_fueling_positions | rv_lanes_fueling_positions | hi_flow_lanes_fueling_positions | hi_flow_lanes_fueling_positions_2 | rv_lanes_fueling_positions_2 | mens_toilet_count | mens_urinal_count | womens_toilet_count | womens_sink_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 13908.000000 | 13908.000000 | 13908.000000 | 13908.000000 | 13908.000000 | 13908.000000 | 13908.000000 | 13908.000000 | 13908.000000 | 13908.000000 | ... | 13542.000000 | 13542.000000 | 13542.000000 | 13542.000000 | 13542.000000 | 13542.000000 | 13542.000000 | 13542.000000 | 13542.000000 | 13542.000000 |
| mean | 26.501079 | 2846.537988 | 759.922326 | 1702.585227 | 2382.091588 | 23041.052632 | 2021.820032 | 6.526172 | 15.721024 | 183.004242 | ... | 59849.837838 | 14.270270 | 2.513514 | 3.324324 | 3.324324 | 2.513514 | 2.378378 | 2.351351 | 4.648649 | 1.702703 |
| std | 14.998715 | 981.299870 | 341.578220 | 2161.208192 | 1025.518658 | 710.634218 | 0.644380 | 3.446887 | 8.794284 | 105.333929 | ... | 18710.108775 | 3.895038 | 2.021870 | 2.885803 | 2.885803 | 2.021870 | 0.910969 | 0.845247 | 1.727681 | 0.730257 |
| min | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 240.180500 | 21560.000000 | 2021.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 31540.000000 | 10.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 14.000000 | 2181.156250 | 521.087875 | 383.062750 | 1654.149000 | 22540.000000 | 2021.000000 | 4.000000 | 8.000000 | 92.000000 | ... | 48170.000000 | 12.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 4.000000 | 1.000000 |
| 50% | 26.000000 | 2693.976250 | 697.434500 | 1018.920000 | 2256.677500 | 22907.500000 | 2022.000000 | 7.000000 | 16.000000 | 183.000000 | ... | 53805.000000 | 12.000000 | 4.000000 | 5.000000 | 5.000000 | 4.000000 | 2.000000 | 2.000000 | 4.000000 | 2.000000 |
| 75% | 39.000000 | 3325.306250 | 924.282625 | 2283.297625 | 2928.254000 | 23555.000000 | 2022.000000 | 10.000000 | 23.000000 | 274.000000 | ... | 77818.000000 | 16.000000 | 4.000000 | 5.000000 | 5.000000 | 4.000000 | 3.000000 | 3.000000 | 6.000000 | 2.000000 |
| max | 52.000000 | 7172.466000 | 2531.662000 | 20853.952000 | 8077.233500 | 24535.000000 | 2023.000000 | 12.000000 | 31.000000 | 365.000000 | ... | 108534.000000 | 24.000000 | 6.000000 | 9.000000 | 9.000000 | 6.000000 | 5.000000 | 5.000000 | 10.000000 | 4.000000 |
8 rows × 37 columns
bonfire_unique = bonfire.copy()
# Group by 'site_id_msba' and calculate the average for each group
bonfire_unique['avg_inside_sales'] = bonfire_unique.groupby('site_id_msba')['daily_yoy_ndt.total_inside_sales'].transform('mean')
bonfire_unique['avg_food_service'] = bonfire_unique.groupby('site_id_msba')['daily_yoy_ndt.total_food_service'].transform('mean')
bonfire_unique['avg_diesel'] = bonfire_unique.groupby('site_id_msba')['diesel_x'].transform('mean')
bonfire_unique['avg_unleaded'] = bonfire_unique.groupby('site_id_msba')['unleaded'].transform('mean')
# Columns to keep in the DataFrame
columns_to_keep = ['capital_projects.soft_opening_date', 'site_id_msba', 'avg_inside_sales', 'avg_food_service', 'avg_diesel', 'avg_unleaded']
# New DataFrame with only the specified columns
bonfire_unique = bonfire_unique[columns_to_keep].copy()
# Dropping duplicate rows based on'site_id_msba'
bonfire_unique = bonfire_unique.drop_duplicates(subset=['site_id_msba'])
# Merging bonfire_unique with store_features
merged_data = pd.merge(bonfire_unique, store_features, on='site_id_msba')
# Sorting the merged dataset by 'site_id_msba' in ascending order
merged_data = merged_data.sort_values(by='site_id_msba', ascending=True)
# Resetting the index to ensure continuous row numbering
merged_data = merged_data.reset_index(drop=True)
merged_data_food = merged_data.copy()
# Convert 'Yes' and 'No' to 1 and 0 for the relevant columns
columns_to_convert = ['freal', 'bonfire_grill', 'pizza', 'cinnabon', 'godfather_s_pizza']
merged_data_food[columns_to_convert] = (merged_data_food[columns_to_convert] == 'Yes').astype(int)
# Group the data and calculate the mean of avg_food_service
grouped_data = merged_data_food.groupby(columns_to_convert)['avg_food_service'].mean().reset_index()
plt.figure(figsize=(12, 8))
# Labels for the legend
x_labels = ['freal', 'bonfire_grill', 'pizza', 'cinnabon', 'godfather_s_pizza']
combinations = grouped_data[x_labels].drop_duplicates()
for i, row in combinations.iterrows():
label = ', '.join([x_labels[j] for j in range(len(x_labels)) if row[j] == 1])
plt.bar(label, grouped_data[(grouped_data[x_labels] == row).all(axis=1)]['avg_food_service'].values[0], label=label)
# Reduce legend box size
plt.legend(loc='upper left', prop={'size': 8})
plt.title('Impact of Variable Availability on avg_food_service')
plt.xlabel('Variable Combinations')
plt.ylabel('Average avg_food_service')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
From the aboove bar plot, it can be depicted that the presence of bonfire_grill and pizza can have a positve impact on the food service sales.
# Select the columns of interest
columns_to_analyze = ['freal', 'bonfire_grill', 'pizza', 'cinnabon', 'godfather_s_pizza', 'avg_food_service']
# Subset the data
subset_data = merged_data_food[columns_to_analyze]
# Calculate the correlation matrix
correlation_matrix = subset_data.corr()
# Create a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, linewidths=0.5)
plt.title('Correlation Heatmap of Variables and avg_food_service')
plt.tight_layout()
plt.show()
Here, the coorelation between avg_food_service sales with pizza is 0.58 and the coorelation between avg_food_service sales with bonfire_grill is 0.46. This suggests that pizza and bonfire have strong positive coorelation with avg_food_service. Also, pizza and bonfire have a positive coorelation of 0.42 between them.
# Demographic columns
demographics_columns = ['x1_2_mile_pop', 'x1_2_mile_income', 'x1_2_mile_emp',
'x1_mile_pop', 'x1_mile_income', 'x1_mile_emp',
'x5_min_pop', 'x5_min_inc', 'x5_min_emp',
'x7_min_pop', 'x7_min_inc', 'x7_min_emp']
# Looping through the target variables
for target_variable in ['avg_inside_sales']:
plt.figure(figsize=(12, 8))
# Subplots with shared y-axes for each row
fig, axes = plt.subplots(3, 4, figsize=(12, 8), sharey=True)
fig.subplots_adjust(wspace=0.4, hspace=0.4)
for i, demo_column in enumerate(demographics_columns):
ax = axes[i // 4, i % 4] # Get the correct subplot
# Scatter plot
ax.scatter(merged_data[demo_column], merged_data[target_variable], alpha=0.5)
ax.set_xlabel(demo_column)
if i % 4 == 0:
ax.set_ylabel(target_variable)
# Calculation of the correlation coefficient
correlation_coefficient = np.corrcoef(merged_data[demo_column], merged_data[target_variable])[0, 1]
# Display correlation value inside the plot
ax.text(0.1, 0.85, f'Correlation: {correlation_coefficient:.2f}', transform=ax.transAxes, fontsize=10, color='blue')
plt.tight_layout()
plt.show()
<Figure size 1200x800 with 0 Axes>
In this analysis, we have examined the relationships between income (inc), population (pop), workplace individuals (emp) within various proximity radii (1/2 mile, 1 mile, 5 minutes, and 7 minutes) and the average inside sales of the stores. The observed correlations between these demographic factors and inside sales appear to be negative. However, it is worth noting that the most substantial negative correlation observed is only -0.27. Therefore, further investigation is required to gain a more comprehensive understanding of how these demographic variables may impact inside sales.
# Looping through the target variables
for target_variable in ['avg_food_service']:
plt.figure(figsize=(12, 8))
# Subplots with shared y-axes for each row
fig, axes = plt.subplots(3, 4, figsize=(12, 8), sharey=True)
fig.subplots_adjust(wspace=0.4, hspace=0.4)
for i, demo_column in enumerate(demographics_columns):
ax = axes[i // 4, i % 4] # Get the correct subplot
# Scatter plot
ax.scatter(merged_data[demo_column], merged_data[target_variable], alpha=0.5)
ax.set_xlabel(demo_column)
if i % 4 == 0:
ax.set_ylabel(target_variable)
# Calculation of the correlation coefficient
correlation_coefficient = np.corrcoef(merged_data[demo_column], merged_data[target_variable])[0, 1]
# Display correlation value inside the plot
ax.text(0.1, 0.85, f'Correlation: {correlation_coefficient:.2f}', transform=ax.transAxes, fontsize=10, color='blue')
plt.tight_layout()
plt.show()
<Figure size 1200x800 with 0 Axes>
Much like the observations made for inside sales, we have found that average food services also exhibit negative correlations with various demographic factors. However, it's important to note that the minimum negative correlation value observed is -0.12 (specifically with 'x1_2_mile_emp'), while the maximum negative correlation value is -0.49 (observed with 'x1_mile_pop'). This suggests that there is an adverse relationship between food services and factors such as population, the number of working individuals, and income, as indicated by the correlation values. To gain a more thorough understanding of these relationships, we will delve further in the modeling phase of our analysis.
# Looping through the target variables
for target_variable in ['avg_diesel']:
plt.figure(figsize=(12, 8))
# Subplots with shared y-axes for each row
fig, axes = plt.subplots(3, 4, figsize=(12, 8), sharey=True)
fig.subplots_adjust(wspace=0.4, hspace=0.4)
for i, demo_column in enumerate(demographics_columns):
ax = axes[i // 4, i % 4] # Get the correct subplot
# Scatter plot
ax.scatter(merged_data[demo_column], merged_data[target_variable], alpha=0.5)
ax.set_xlabel(demo_column)
if i % 4 == 0:
ax.set_ylabel(target_variable)
# Calculation of the correlation coefficient
correlation_coefficient = np.corrcoef(merged_data[demo_column], merged_data[target_variable])[0, 1]
# Display correlation value inside the plot
ax.text(0.1, 0.85, f'Correlation: {correlation_coefficient:.2f}', transform=ax.transAxes, fontsize=10, color='blue')
plt.tight_layout()
plt.show()
<Figure size 1200x800 with 0 Axes>
In this context, we can observe that the correlations between diesel gallons sold and various demographic factors tend to be relatively weaker in most cases. Notably, there is a positive correlation of 0.12 between average diesel sales (avg_diesel) and the number of working individuals within a half-mile radius (x1_2_mile_emp). Conversely, the most significant negative correlation observed, with a value of -0.47, was found to be associated with the population within a one-mile radius (x1_mile_pop).
# Looping through the target variables
for target_variable in ['avg_unleaded']:
plt.figure(figsize=(12, 8))
# Subplots with shared y-axes for each row
fig, axes = plt.subplots(3, 4, figsize=(12, 8), sharey=True)
fig.subplots_adjust(wspace=0.4, hspace=0.4)
for i, demo_column in enumerate(demographics_columns):
ax = axes[i // 4, i % 4] # Get the correct subplot
# Scatter plot
ax.scatter(merged_data[demo_column], merged_data[target_variable], alpha=0.5)
ax.set_xlabel(demo_column)
if i % 4 == 0:
ax.set_ylabel(target_variable)
# Calculation of the correlation coefficient
correlation_coefficient = np.corrcoef(merged_data[demo_column], merged_data[target_variable])[0, 1]
# Display correlation value inside the plot
ax.text(0.1, 0.85, f'Correlation: {correlation_coefficient:.2f}', transform=ax.transAxes, fontsize=10, color='blue')
plt.tight_layout()
plt.show()
<Figure size 1200x800 with 0 Axes>
The correlation between average unleaded fuel sales (avg_unleaded) and demographic factors appears to be relatively weak. A positive correlation of 0.12 is observed between avg_unleaded sales and income within a five-minute radius (x5_min_inc). The most substantial negative correlation, with a value of -0.29, was found in this analysis.
# Employee-to-population ratios for each radius
merged_data['x1_2_mile_emp_pop_ratio'] = merged_data['x1_2_mile_emp'] / merged_data['x1_2_mile_pop']
merged_data['x1_mile_emp_pop_ratio'] = merged_data['x1_mile_emp'] / merged_data['x1_mile_pop']
merged_data['x5_min_emp_pop_ratio'] = merged_data['x5_min_emp'] / merged_data['x5_min_pop']
merged_data['x7_min_emp_pop_ratio'] = merged_data['x7_min_emp'] / merged_data['x7_min_pop']
# Columns of interest
employee_population_ratios = ['x1_2_mile_emp_pop_ratio', 'x1_mile_emp_pop_ratio',
'x5_min_emp_pop_ratio', 'x7_min_emp_pop_ratio']
# Looping through the target variables
for target_variable in ['avg_inside_sales', 'avg_food_service', 'avg_diesel', 'avg_unleaded']:
plt.figure(figsize=(12, 8))
for i, emp_pop_ratio_column in enumerate(employee_population_ratios):
plt.subplot(2, 2, i + 1)
# Scatter plot for the current employee-to-population ratio vs. the target variable
plt.scatter(merged_data[emp_pop_ratio_column], merged_data[target_variable], alpha=0.5)
plt.xlabel(emp_pop_ratio_column)
plt.ylabel(target_variable)
plt.title(f'{emp_pop_ratio_column} vs. {target_variable}')
# Linear regression statistics
x = merged_data[emp_pop_ratio_column]
y = merged_data[target_variable]
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
# Correlation direction (positive or negative)
correlation_direction = "Positive" if slope > 0 else "Negative"
# Trend line in green for positive correlation and red for negative correlation
if correlation_direction == "Positive":
color = 'green'
else:
color = 'red'
# Plotting the regression line
plt.plot(x, slope * x + intercept, color=color)
plt.tight_layout()
plt.show()
This analysis involved a comparison of the employee-to-population ratio with the target variables. It was observed that as the employee-to-population ratio increased, the target variables showed a tendency for positive correlations, a trend that is visually represented by the trend lines. Specifically, both food service and inside sales exhibited notably strong positive correlations with the employee-to-population ratio at the seven-minute radius (x7_min). Additionally, there was a slight positive trend with diesel sales, although it was less pronounced. However, diesel sales displayed a negative trend in certain cases. These overall findings will be further explored and interpreted in the subsequent modeling phase.
from sklearn.preprocessing import LabelEncoder
object_columns = bonfire.select_dtypes(include=['object']).columns.tolist()
# Initialize LabelEncoder
label_encoder = LabelEncoder()
# Apply label encoding to object columns
for col in object_columns:
bonfire[col] = label_encoder.fit_transform(bonfire[col])
The Label Encoder transforms categorical variables into numerical values, preparing the data for identifying key contributors effectively.
bonfire.head()
| capital_projects.soft_opening_date | calendar.fiscal_week_id_for_year | calendar.day_of_week | calendar_information.holiday | calendar_information.type_of_day | daily_yoy_ndt.total_inside_sales | daily_yoy_ndt.total_food_service | diesel_x | unleaded | site_id_msba | ... | hi_flow_lanes_fueling_positions_2 | rv_lanes_fueling_positions_2 | hi_flow_rv_lanes_layout | hi_flow_rv_lanes_stack_type | non_24_hour | self_check_out | mens_toilet_count | mens_urinal_count | womens_toilet_count | womens_sink_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2021-01-12 | 2 | 5 | 16 | 0 | 2036.2685 | 762.8530 | 1424.1850 | 1522.0030 | 21560 | ... | 4.0 | 6.0 | 2 | 0 | 0 | 0 | 2.0 | 2.0 | 6.0 | 2.0 |
| 1 | 2021-01-12 | 2 | 6 | 16 | 0 | 2379.8880 | 1003.7930 | 1303.8445 | 1853.7715 | 21560 | ... | 4.0 | 6.0 | 2 | 0 | 0 | 0 | 2.0 | 2.0 | 6.0 | 2.0 |
| 2 | 2021-01-12 | 2 | 4 | 16 | 0 | 2435.8600 | 974.2250 | 1375.6785 | 2122.4070 | 21560 | ... | 4.0 | 6.0 | 2 | 0 | 0 | 0 | 2.0 | 2.0 | 6.0 | 2.0 |
| 3 | 2021-01-12 | 3 | 0 | 16 | 0 | 2805.9780 | 911.0115 | 1334.9175 | 2267.9930 | 21560 | ... | 4.0 | 6.0 | 2 | 0 | 0 | 0 | 2.0 | 2.0 | 6.0 | 2.0 |
| 4 | 2021-01-12 | 3 | 2 | 16 | 1 | 2314.7635 | 715.7535 | 831.1625 | 1819.6395 | 21560 | ... | 4.0 | 6.0 | 2 | 0 | 0 | 0 | 2.0 | 2.0 | 6.0 | 2.0 |
5 rows × 69 columns
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Define your target variables
target_variables = ['daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service', 'diesel_x', 'unleaded']
# Initialize a DataFrame to store correlation results
correlation_results = pd.DataFrame(index=bonfire.columns, columns=target_variables)
# Calculate correlations for each target variable
for target in target_variables:
correlations = bonfire.corr()[target]
correlation_results[target] = correlations.abs().sort_values(ascending=False)
# Create horizontal bar plots for the top 5 contributing features for each target variable using Matplotlib
colors = plt.cm.viridis(np.linspace(0, 1, 6)) # Create a color palette
for target in target_variables:
top_contributors = correlation_results[target].nlargest(6) # Get the top 5 contributors + the target itself
top_contributors = top_contributors.drop(target) # Drop the target variable itself
# Create a horizontal bar plot
plt.figure(figsize=(8, 4))
plt.barh(top_contributors.index, top_contributors.values, color=colors)
# Customize the appearance
plt.xlabel('Correlation (Absolute Value)')
plt.ylabel('Feature')
plt.title(f'Top 5 Contributing Features for {target}')
plt.gca().invert_yaxis() # Invert the y-axis to show the top contributor at the top
# Show the plot
plt.show()
The above plots illustrate the primary factors influencing sales metrices such as diesel, unleaded, food, and inside sales. The following observations are made:
Results
This text presents a series of exploratory data analysis findings related to store sales and demographic factors. The key points are as follows:
In addition to the above findings, the text also highlights some key observations about the correlation between store sales and demographic factors. For example, the text notes that there is a strong positive correlation between the presence of bonfire_grill and pizza and food service sales. This suggests that these two businesses may be able to boost their sales by working together or by locating their businesses in close proximity to each other.
As the employee-to-population ratio increases, store sales tend to show a tendency for positive correlations. This suggests that businesses may (or already are) want to consider locating their businesses in areas with a high employee-to-population ratio, as this may lead to increased sales.
Overall, our target variables are impacted by many variables such as holidays, and Friday vs Sunday; therefore, findings from this analysis will be used to train and fine-tune our forecasting model for the next step.
Next step Before we jump straight into model building, we will finish cleaning the data by addressing null values and factorizing datasets. We will be splitting our merged data into training and testing sets using the 70-30 ratio method.
We will utilize the findings from our exploratory data analysis when running basic forecasting models using the current industry-leading methods such as the Naïve Base and Regression model. We will compare our model performance which is yielding the most promising results. From there, we can finalize the data we will use to train the model and which type of model we will use.
As we progress with analysis, we will continuously revisit the business problem and address any glitches in our findings.